语音情感识别研究进展综述
作者:
基金项目:

国家自然科学基金(61171186,61271345);语言语音教育部微软重点实验室开放基金(HIT.KLOF.2011XXX);中央高校基本科研业务费专项资金(HIT.NSRIF.2012047)


Review on Speech Emotion Recognition
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [83]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    对语音情感识别的研究现状和进展进行了归纳和总结,对未来语音情感识别技术发展趋势进行了展望. 从5个角度逐步展开进行归纳总结,即情感描述模型、具有代表性的情感语音库、语音情感特征提取、语音情感识别算法研究和语音情感识别技术应用,旨在尽可能全面地对语音情感识别技术进行细致的介绍与分析,为相关研究人员提供有价值的学术参考;最后,立足于研究现状的分析与把握,对当前语音情感识别领域所面临的挑战与发展趋势进行了展望.侧重于对语音情感识别研究的主流方法和前沿进展进行概括、比较和分析.

    Abstract:

    This paper surveys the state of the art of speech emotion recognition (SER), and presents an outlook on the trend of future SER technology. First, the survey summarizes and analyzes SER in detail from five perspectives, including emotion representation models, representative emotional speech corpora, emotion-related acoustic features extraction, SER methods and applications. Then, based on the survey, the challenges faced by current SER research are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, and presents detailed comparison and analysis between these methods.

    参考文献
    [1] van Bezooijen R, Otto SA, Heenan TA. Recognition of vocal expressions of emotion: A three-nation study to identify universal characteristics. Journal of Cross-Cultural Psychology, 1983,14(4):387-406. [doi: 10.1177/0022002183014004001]
    [2] Tolkmitt FJ, Scherer KR. Effect of experimentally induced stress on vocal parameters. Journal of Experimental Psychology Human Perception Performance, 1986,12(3):302-313. [doi: 10.1037/0096-1523.12.3.302]
    [3] Cahn JE. The generation of affect in synthesized speech. Journal of the American Voice Input/Output Society, 1990,8:1-19.
    [4] Moriyama T, Ozawa S. Emotion recognition and synthesis system on speech. In: Proc. of the 1999 IEEE Int''l Conf. on Multimedia Computing and Systems (ICMCS). Florence: IEEE Computer Society, 1999. 840-844. [doi: 10.1109/MMCS.1999.779310]
    [5] Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schroder M. Feeltrace: An instrument for recording perceived emotion in real time. In: Proc. of the 2000 ISCA Workshop on Speech and Emotion: A Conceptual Frame Work for Research. Belfast: ISCA, 2000. 19-24.
    [6] Grimm M, Kroschel K. Evaluation of natural emotions using self assessment manikins. In: Proc. of the 2005 ASRU. Cancun, 2005. 381-385. [doi: 10.1109/ASRU.2005.1566530]
    [7] Grimm M, Kroschel K, Narayanan S. Support vector regression for automatic recognition of spontaneous emotions in speech. In: Proc. of the 2007 IEEE Int''l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). IEEE Computer Society, 2007. 1085-1088. [doi: 10.1109/ICASSP.2007.367262]
    [8] Eyben F, Wollmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R. On-Line emotion recognition in a 3-D activation-valence- time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 2010,3(1-2):7-19. [doi: 10.1007/s12193-009-0032-6]
    [9] Giannakopoulos T, Pikrakis A, Theodoridis S. A dimensional approach to emotion recognition of speech from movies. In: Proc. of the 2009 IEEE Int''l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Taibe: IEEE Computer Society, 2009. 65-68. [doi: 10.1109/ICASSP.2009.4959521]
    [10] Wu DR, Parsons TD, Mower E, Narayanan S. Speech emotion estimation in 3d space. In: Proc. of the 2010 IEEE Int''l Conf. on Multimedia and Expo (ICME). Singapore: IEEE Computer Society, 2010. 737-742. [doi: 10.1109/ICME.2010.5583101]
    [11] Karadogan SG, Larsen J. Combining semantic and acoustic features for valence and arousal recognition in speech. In: Proc. of the 2012 Int''l Workshop on Cognitive Information Processing (CIP). IEEE Computer Society, 2012. 1-6. [doi: 10.1109/CIP.2012. 6232924]
    [12] Eyben F, Wollmer M, Schuller B. openSMILE—The Munich versatile and fast open-source audio feature extractor. In: Proc. of the 2010 ACM Multimedia. Firenze, 2010. 1459-1462. [doi: 10.1145/1873951.1874246]
    [13] Schuller B, Valstar M, Eyben F, McKeown G, Cowie R, Pantic M. AVEC 2011—The first international audio/visual emotion challenge. In: Proc. of the 2011 Affective Computing and Intelligent Interaction, SER. Lecture Notes in Computer Science, Memphis: Berlin, Heidelberg: Springer-Verlag, 2011. 415-424. [doi: 10.1007/978-3-642-24571-8_53]
    [14] Schuller B, Valstar M, Eyben F, Cowie R, Pantic M. Avec 2012 the continuous audio/visual emotion challenge. In: Proc. of the 2012 Int''l Audio/Visual Emotion Challenge and Workshop (AVEC), Grand Challenge and Satellite of ACM ICMI 2012. Santa Monica: ACM Press, 2012. [doi: 10.1145/2388676.2388758]
    [15] McKeown G, Valstar MF, Cowie R, Pantic M. The semaine corpus of emotionally coloured character interactions. In: Proc. of the 2010 IEEE Int''l Conf. on Multimedia and Expo (ICME). Singapore: IEEE Computer Society, 2010. 1079-1084. [doi: 10.1109/ICME.2010.5583006]
    [16] Ortony A, Turner TJ. What''s basic about basic emotions. Psychological Review, 1990,97(3):315-331. [doi: 10.1037/0033-295X.97. 3.315]
    [17] Ekman P, Power MJ. Handbook of Cognition and Emotion. Sussex: John Wiley & Sons, 1999.
    [18] Xie B. Research on key issues of Mandarin speech emotion recognition [Ph.D. Thesis]. Hangzhou: Zhejiang University, 2006 (in Chinese with English abstract).
    [19] Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG. Emotion recognition in human-computer interaction. In: Proc. of the IEEE Signal Processing Magazine. 2001. 32-80. http://www.signalprocessingsociety.org/
    [20] Ververidis D, Kotropoulos C. Emotional speech recognition: Resources, features, and methods. In: Proc. of the Speech Communication. 2006. 1162-1181. [doi: 10.1016/j.specom.2006.04.003]
    [21] Ayadi ME, Kamel MS, Karray F. Survey on speech emotion recognition: Features, classification schemes, databases. Pattern Recognition, 2011,44(3):572-587. [doi: 10.1016/j.patcog.2010.09.020]
    [22] Ververidis D, Kotropoulos C. A state of the art review on emotional speech databases. In: Proc. of the 2003 Richmedia Conf. Lausanne. Switzerland, 2003. 109-119.
    [23] McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M, Stroeve S. Approaching automatic recognition of emotion from voice: A rough benchmark. In: Proc. of the 2000 ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Belfast: ISCA, 2000. 207-212.
    [24] Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B. A database of german emotional speech. In: Proc. of the 2005 INTERSPEECH. Lisbon: ISCA, 2005. 1517-1520.
    [25] Steidl S. Automatic classification of emotion-related user states in spontaneous children''s speech [Ph.D. Thesis]. Erlangen: University at Erlangen Nurberg, 2009.
    [26] Schuller B, Steidl S, Batliner A. The INTERSPEECH 2009 emotion challenge. In: Proc. of the 2009 INTERSPEECH. Brighton: ISCA, 2009. 312-315.
    [27] Grimm M, Kroschel K, Narayanan S. The vera am mittag german audiovisual emotional speech database. In: Proc. of the 2008 IEEE Int''l Conf. on Multimedia and Expo (ICME). Hannover: IEEE Computer Society, 2008. 865-868. [doi: 10.1109/ICME.2008. 4607572]
    [28] Schroder M, Cowie R. Issues in emotion-oriented computing—Towards a shared understanding. In: Proc. of the 2006 Workshop on Emotion and Computing. Hannover: IEEE Computer Society, 2006. 865-868.
    [29] Lee CM, Narayanan SS. Toward detecting emotions in spoken dialogs. IEEE Trans. on Speech and Audio Processing, 2005,13(2): 293-303. [doi: 10.1109/TSA.2004.838534]
    [30] Iliou T, Anagnostopoulos CN. Statistical evaluation of speech features for emotion recognition. In: Proc. of the 2009 Int''l Conf. on Digital Telecommunications. Colmar: IEEE Computer Society, 2009. 121-126. [doi: 10.1109/ICDT.2009.30]
    [31] Luengo I, Navas E, Hernaez I, Sanchez J. Automatic emotion recognition using prosodic parameters. In: Proc. of the 2005 INTERSPEECH. Lisbon: ISCA, 2005. 493-496.
    [32] Origlia A, Galata V, Ludusan B. Automatic classification of emotions via global and local prosodic features on a multilingual emotional database. In: Proc. of the 2010 Speech Prosody. Chicago, 2010.
    [33] Seppanen T, Vayrynen E, Toivanen J. Prosody-Based classification of emotions in spoken finnish. In: Proc. of the 2003 European Conf. on Speech Communication and Technology (EUROSPEECH). Geneval: ISCA, 2003. 717-720.
    [34] Wang Y, Du SF, Zhan YZ. Adaptive and optimal classification of speech emotion recognition. In: Proc. of the 2008 Int''l Conf. on Natural Computation. Ji''nan: IEEE Computer Society, 2008. 407-411. [doi: 10.1109/ICNC.2008.713]
    [35] Oster A-M, Risberg A. The Identification of Mood of a Speaker by Hearing Impaired Listeners. Stockholm, 1986. 79-90. http://www.speech.kth.s/qpsr
    [36] Rabiner LR, Schafer RW. Digital Processing of Speech Signal. London: Prentice Hall, 1978.
    [37] Borchert M, Düsterhöft A. Emotion in speech—Experiments with prosody and quality features in speech for use in categorical and dimensional emotion recognition environments. In: Proc. of the 2005 IEEE Int''l Conf. on Natural Language Processing and Knowledge Engineering. IEEE Computer Society, 2005. 147-151. [doi: 10.1109/NLPKE.2005.1598724]
    [38] Tao JH, Kang YG, Li AJ. Prosody conversion from natural speech to emotional speech. IEEE Trans. on Audio, Speech, and Language Processing, 2006,14(4):1145-1154. [doi: 10.1109/TASL.2006.876113]
    [39] Benesty J, Sondhi MM, Huang Y. Springer Handbook of Speech Processing. Berlin: Springer-Verlag, 2008. [doi: 10.1007/978-3- 540-49127-9]
    [40] O''Shaughnessy D. Invited paper: Automatic speech recognition: History, methods and challenges. Pattern Recognition, 2008,41 (10):2965-2979. [doi: 10.1016/j.patcog.2008.05.008]
    [41] Wang LB, Minami K, Yamamoto K, Nakagawa S. Speaker identification by combining MFCC and phase information in noisy environments. In: Proc. of the 2010 IEEE Int''l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Dallas: IEEE Computer Society, 2010. 4502-4505. [doi: 10.1109/ICASSP.2010.5495586]
    [42] Nakagawa S, Wang LB, Ohtsuka S. Speaker identification and verification by combining mfcc and phase information. IEEE Trans. on Audio, Speech, and Language Processing, 2012,20(4):1085-1095. [doi: 10.1109/TASL.2011.2172422]
    [43] Nwe TL, Foo SW, De Silva LC. Speech emotion recognition using hidden Markov models. Speech Communication, 2003,41(4): 603-623. [doi: 10.1016/S0167-6393(03)00099-2]
    [44] Bou-Ghazale SE, Hansen JHL. A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Trans. on Speech and Audio Processing, 2000,8(4):429-442. [doi: 10.1109/89.848224]
    [45] Bitouk D, Verma R, Nenkova A. Class-Level spectral features for emotion recognition. Speech Communication, 2010,52(7-8): 613-625. [doi: 10.1016/j.specom.2010.02.010]
    [46] Chauhan R, Yadav J, Koolagudi SG, Rao KS. Text independent emotion recognition using spectral features. In: Proc. of the 2011 Int''l Conf. on Contemporary Computing. Berlin, Heidelberg: Springer-Verlag, 2011. 359-370. [doi: 10.1007/978-3-642-22606- 9_37]
    [47] Wu SQ, Falk TH, Chan WY. Automatic speech emotion recognition using modulation spectral features. Speech Communication, 2011,53(5):768-785. [doi: 10.1016/j.specom.2010.08.013]
    [48] Hernando J, Nadeu C. Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition. IEEE Trans. on Speech and Audio Processing, 1997,5(1):80-84. [doi: 10.1109/89.554273]
    [49] Gobl C, Chasaide AN. The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 2003, 40(1-2):189-212. [doi: 10.1016/S0167-6393(02)00082-1]
    [50] Gelfer MP, Fendel DM. Comparison of jitter, shimmer, and signal-to-noise ratio from directly digitized versus taped voice samples. Journal of Voice, 1995,9(4):378-382. [doi: 10.1016/S0892-1997(05)80199-7]
    [51] Lugger M, Yang B. The relevance of voice quality features in speaker independent emotion recognition. In: Proc. of the 2007 IEEE Int''l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Honolulu: IEEE Computer Society, 2007. 17-20. [doi: 10.1109/ICASSP.2007.367152]
    [52] Lugger M, Yang B. Psychological motivated multi-stage emotion classification exploiting voice quality features. In: Proc. of the Speech Recognition. 2008.
    [53] Lugger M, Janoir ME, Yang B. Combining classifiers with diverse feature sets for robust speaker independent emotion recognition. In: Proc. of the 2009 European Signal Processing Conf. Glagow: EURASIP, 2009. 1225-1229.
    [54] Li X, Tao JD, Johnson MT, Soltis J, Savage A, Leong KM, Newman JD. Stress and emotion classification using jitter and shimmer features. In: Proc. of the 2007 IEEE Int''l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Honolulu: IEEE Computer Society, 2007. 1081-1084. [doi: 10.1109/ICASSP.2007.367261]
    [55] Sun R, Moore E, Torres JF. Investigating glottal parameters for differentiating emotional categories with similar prosodics. In: Proc. of the 2009 IEEE Int''l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Taibe: IEEE Computer Society, 2009. 4509-4512. [doi: 10.1109/ICASSP.2009.4960632]
    [56] Sanchez MH, Vergyri D, Ferrer L, Richey C, Garcia P, Knoth B, Jarrold W. Using prosodic and spectral features in detecting depression in elderly males. In: Proc. of the 2011 INTERSPEECH. Florence: ISCA, 2011. 3001-3004.
    [57] Schuller B, Burkhardt F. Learning with synthesized speech for automatic emotion recognition. In: Proc. of the 2010 IEEE Int''l Conf. on Acoustics Speech and Signal Processing (ICASSP). Dallas: IEEE Computer Society, 2010. 5150-5153. [doi: 10.1109/ICASSP.2010.5495017]
    [58] Espinosa HP, García CA, Pineda LV. Features selection for primitives estimation on emotional speech. In: Proc. of the 2010 IEEE Int''l Conf. on Acoustics Speech and Signal Processing (ICASSP). Dallas: IEEE Computer Society, 2010. 5138-5141. [doi: 10. 1109/ICASSP.2010.5495031]
    [59] Malandrakis N, Potamianos A, Evangelopoulos G, Zlatintsi A. A supervised approach to movie emotion tracking. In: Proc. of the 2011 IEEE Int''l Conf. on Acoustics, Speech and Signal Processing (ICASSP). Prague: IEEE Computer Society, 2011. 2376-2379. [doi: 10.1109/ICASSP.2011.5946961]
    [60] Lee CC, Mower E, Busso C, Lee S, Narayanan S. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 2011,53(9-10):1162-1171. [doi: 10.1016/j.specom.2011.06.004]
    [61] Martin O, Kotsia I, Macq B, Pitas I. The enterface 2005 audio-visual emotion database. In: Proc. of the 2006 Int''l Conf. on Data Engineering Workshops. Washington: IEEE Computer Society, 2006. 8.
    [62] Xia R, Liu Y. Using i-vector space model for emotion recognition. In: Proc. of the INTERSPEECH 2012. Portland: ISCA, 2012. 2230-2233.
    [63] Schuller B, Rigoll G, Lang M. Hidden Markov model-based speech emotion recognition. In: Proc. of the 2003 IEEE Int''l Conf. on Acoustics, Speech, Signal Processing. Hong Kong: IEEE Computer Society, 2003. II-1.
    [64] Lee CM, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S. Emotion recognition based on phoneme classes. In: Proc. of the 2004 INTERSPEECH. Jeju Island: ISCA, 2004. 889-892.
    [65] Breazeal C, Aryananda L. Recognition of affective communicative intent in robot-directed speech. Autonomous Robots, 2002,12(1): 83-104. [doi: 10.1023/A:1013215010749]
    [66] Tang H, Chu SM, Hasegawa-Johnson M, Huang TS. Emotion recognition from speech via boosted gaussian mixture models. In: Proc. of the 2009 IEEE Int''l Conf. on Ultimedia and Expo (ICME). Piscataway: IEEE Press, 2009. 294-297. [doi: 10.1109/ICME. 2009.5202493]
    [67] Yuan G, Lim TS, Juan WK, Ringo HMH, Li Q. A GMM based 2-stage architecture for multi-subject emotion recognition using physiological responses. In: Proc. of the 2010 Augmented Human Int''l Conf. New York: ACM Press, 2010. 3:1-3:6. [doi: 10.1145/1785455.1785458]
    [68] Nicholson J, Takahashi K, Nakatsu R. Emotion recognition in speech using neural networks. Neural Computing & Applications, 2000,9(4):290-296. [doi: 10.1007/s005210070006]
    [69] Petrushin VA. Emotion recognition in speech signal: Experimental study, development, and application. In: Proc. of the 2000 Int''l Conf. on Spoken Language Processing. 2000. 222-225.
    [70] Bhatti MW, Wang YJ, Guan L. A neural network approach for human emotion recognition in speech. In: Proc. of the 2004 Int''l Symp. on Circuits and Systems. Vancouver, 2004. 181-184. [doi: 10.1109/ISCAS.2004.1329238]
    [71] Hassan A, Damper RI. Multi-Class and hierarchical svms for emotion recognition. In: Proc. of the 2010 INTERSPEECH. Chiba: ISCA, 2010. 2354-2357.
    [72] Vlassis N, Likas A. A greedy em algorithm for gaussian mixture learning. Neural Processing Letters, 2002,15(1):77-87. [doi: 10. 1023/A:1013844811137]
    [73] Reynolds DA, Rose RC. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans. on Speech and Audio Processing, 1995,3(1):72-83. [doi: 10.1109/89.365379]
    [74] Vlassis N, Likas A. A kurtosis-based dynamic approach to Gaussian mixture modeling. IEEE Trans. on Systems, Man and Cybernetics, Part A: Systems and Humans, 1999,29(4):393-399. [doi: 10.1109/3468.769758]
    [75] Schuller B. Towards intuitive speech interaction by the integration of emotional aspects. In: Proc. of the 2002 IEEE Int''l Conf. on Systems, Man and Cybernetics. Hammamet: IEEE Computer Society, 2002. 6. [doi: 10.1109/ICSMC.2002.1175635]
    [76] Zhao XM, Zhang SQ. Robustness speech emotion recognition methods based on compressed sensing. China Patent, CN103021406A, 2013-04-03 (in Chinese).
    [77] Gunes H, Schuller B, Pantic M, Cowie R. Support vector regression for automatic recognition of spontaneous emotions in speech. In: Proc. of the Int''l Workshop on EmoSPACE, Held in Conjunction with the 9th Int''l IEEE Conf. on Face and Gesture Recognition 2011 (FG 2011). Santa Barbara: IEEE Computer Society, 2011. 827-834.
    [78] Schuller B, Rigoll G, Lang M. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proc. of the 2004 IEEE Int''l Conf. on Acoustics, Speech, Signal Processing (ICASSP). Montreal: IEEE Computer Society, 2004. 577-580. [doi: 10.1109/ICASSP.2004.1326051]
    [79] Boril H, Sadjadi SO, Kleinschmidt T, Hansen JHL. Analysis and detection of cognitive load and frustration in drivers'' speech. In: Kobayashi T, Hirose K, Nakamura S, eds. Proc. of the 2010 INTERSPEECH. Chiba: ISCA, 2010. 502-505.
    [80] Chen K, Yue GX, Yu F, Shen Y, Zhu AQ. Research on speech emotion recognition system in e-learning. In: Proc. of the 2007 Int''l Conf. on Computational Science, SER. LNCS 4489, Berlin, Heidelberg: Springer-Verlag, 2007. 555-558. [doi: 10.1007/978-3-540- 72588-6_91]
    [81] Wang WS, Wu JB. Emotion recognition based on CSO&SVM in e-learning. In: Proc. of the 2011 7th Int''l Conf. on Natural Computation (ICNC). Beijing: IEEE Computer Society, 2011. 566-570. [doi: 10.1109/ICNC.2011.6022071]
    [82] France DJ, Shiavi RG, Silverman S, Silverman M, Wilkes MD. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. on Biomedical Engineering, 2000,47(7):829-837.
    [83] Marchi E, Schuller B, Batliner A, Fridenzon S, Tal S, Golan O. Emotion in the speech of children with autism spectrum conditions: Prosody and everything else. In: Proc. of the 2012 Workshop on Child, Computer and Interaction (WOCCI 2012), Satellite Event of INTERSPEECH. Portland: ISCA, 2012.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

韩文静,李海峰,阮华斌,马琳.语音情感识别研究进展综述.软件学报,2014,25(1):37-50

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2013-05-08
  • 最后修改日期:2013-09-02
  • 在线发布日期: 2013-11-04
文章二维码
您是第20047873位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号