Multimodal Emotion Recognition in Multi-Cultural Conditions
Author:
Affiliation:

Fund Project:

National Key Research and Development Program of China (2016YFB1001200)

  • Article
  • | |
  • Metrics
  • |
  • Reference [43]
  • |
  • Related [20]
  • |
  • Cited by
  • | |
  • Comments
    Abstract:

    Automatic emotion recognition is a challenging task with a wide range of applications. This paper addresses the problem of emotion recognition in multi-cultural conditions. Different multi-modal features are extracted from audio and visual modalities, and the emotion recognition performance is compared between hand-crafted features and automatically learned features from deep neural networks. Multimodal feature fusion is also explored to combine different modalities. The CHEAVD Chinese multimodal emotion dataset and AFEW English multimodal emotion dataset are utilized to evaluate the proposed methods. The importance of the culture factor for emotion recognition through cross-culture emotion recognition is demonstrated, and then three different strategies, including selecting corresponding emotion model for different cultures, jointly training with multi-cultural datasets, and embedding features from multi-cultural datasets into the same emotion space, are developed to improve the emotion recognition performance in the multi-cultural environment. The embedding strategy separates the culture influence from original features and can generate more discriminative emotion features, resulting in best performance for acoustic and multimodal emotion recognition.

    Reference
    [1] Zhang S. Reasearch on emotion recognition based on speech and facial expression[Ph.D. Thesis]. Chengdu:University of Electronic Science and Technology of China, 2012(in Chinese with English abstract).
    [2] Elfenbein HA, Ambady N. On the universality and cultural specificity of emotion recognition:A meta-analysis. Psychological Bulletin, 2002,128(2):203.[doi:10.1037/0033-2909.128.2.203]
    [3] Darwin C, Ekman P, Prodger P. The Expression of the Emotions in Man and Animals. New York:Oxford University Press, 1998.
    [4] Tickle A. English and Japanese speakers' emotion vocalisation and recognition:A comparison highlighting vowel quality. In:Proc. of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion. 2000. 104-109. http://www.isca-speech.org/archive_open/archive_papers/speech_emotion/spem_104.pdf
    [5] Zeng Z, Pantic M, Roisman GI, Huang TS. A survey of affect recognition methods:Audio, visual, and spontaneous expressions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009,31(1):39-58.[doi:10.1109/TPAMI.2008.52]
    [6] Han WJ, Li HF, Ruan HB, Ma L. Review on speech emotion recognition. Ruan Jian Xue Bao/Journal of Software, 2014,25(1):37-50(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4497.htm[doi:10.13328/j.cnki.jos.004497]
    [7] Schuller B, Batliner A, Steidl S, Seppi D. Recognising realistic emotions and affect in speech:State of the art and lessons learnt from the first challenge. Speech Communication, 2011,53(9):1062-1087.[doi:10.1016/j.specom.2011.01.011]
    [8] Chen S, Jin Q, Li X, Yang G, Xu J. Speech emotion classification using acoustic features. In:Proc. of the 9th Int'l Symp. on Chinese Spoken Language Processing (ISCSLP). IEEE, 2014. 579-583.[doi:10.1109/ISCSLP.2014.6936664]
    [9] Xia R, Liu Y. Using denoising autoencoder for emotion recognition. In:Proc. of the Interspeech. 2013. 2886-2889. http://isca-speech.org/archive/archive_papers/interspeech_2013/i13_2886.pdf
    [10] Deng J, Xia R, Zhang Z, Liu Y, Schuller B. Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In:Proc. of the 2014 IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014. 4818-4822.[doi:10.1109/ICASSP.2014.6854517]
    [11] Huang Z, Dong M, Mao Q, Zhan Y. Speech emotion recognition using CNN. In:Proc. of the 22nd ACM Int'l Conf. on Multimedia. ACM, 2014. 801-804.[doi:10.1145/2647868.2654984]
    [12] Zhao G, Pietikainen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2007,29(6).[doi:10.1109/TPAMI.2007.1110]
    [13] Yang J, Yu K, Gong Y, Huang T. Linear spatial pyramid matching using sparse coding for image classification. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2009). IEEE, 2009. 1794-1801.[doi:10.1109/CVPR.2009. 5206757]
    [14] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In:Proc. of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR 2005). 2005. 886-893.[doi:10.1109/CVPR.2005.177]
    [15] Yao A, Shao J, Ma N, Chen Y. Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In:Proc. of the 2015 ACM on Int'l Conf. on Multimodal Interaction. ACM, 2015. 451-458.[doi:10.1145/2818346.2830585]
    [16] Kim Y, Mower PE. Say cheese vs. smile:Reducing speech-related variability for facial emotion recognition. In:Proc. of the 22nd ACM Int'l Conf. on Multimedia. ACM, 2014. 27-36.[doi:10.1145/2647868.2654934]
    [17] Chen S, Li X, Jin Q, Zhang S, Qin Y. Video emotion recognition in the wild based on fusion of multimodal features. In:Proc. of the 18th ACM Int'l Conf. on Multimodal Interaction. ACM, 2016. 494-500.[doi:10.1145/2993148.2997629]
    [18] Jung H, Lee S, Yim J, Park S, Kim J. Joint fine-tuning in deep neural networks for facial expression recognition. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2015. 2983-2991.[doi:10.1109/ICCV.2015.341]
    [19] Sebe N, Lew MS, Sun Y, Cohen I, Gevers T, Huang TS. Authentic facial expression analysis. Image and Vision Computing, 2007, 25(12):1856-1863.[doi:10.1016/j.imavis.2005.12.021]
    [20] Pantic M, Rothkrantz LJM. Expert system for automatic analysis of facial expressions. Image and Vision Computing, 2000,18(11):881-905.[doi:10.1016/S0262-8856(00)00034-2]
    [21] Ma L, Khorasani K. Facial expression recognition using constructive feed forward neural networks. IEEE Trans. on Systems, Man, and Cybernetics, Part B:Cybernetics, 2004,34(3):1588-1595.[doi:10.1109/TSMCB.2004.825930]
    [22] Ioannou SV, Raouzaiou AT, Tzouvaras VA, Mailis TP, Karpouzis KC, Kollias SD. Emotion recognition through facialexpression analysis based on a neurofuzzy network. Neural Networks, 2005,18(4):423-435.[doi:10.1016/j.neunet.2005.03.004]
    [23] Cohen I, Sebe N, Garg A, Chen LS, Huang TS. Facial expression recognition from video sequences:Temporaland static modeling. Computer Vision and Image Understanding, 2003,91(1):160-187.[doi:10.1016/S1077-3142(03)00081-X]
    [24] Cao TY. Researchon multi-modal fusion emotion recognition[Ph.D. Thesis]. Tianjin:Tianjin University, 2012(in Chinese with English abstract).
    [25] Chen S, Dian Y, Li X, Lin XZ, Jin Q, Liu HB, Lu L. Emotion recognition in videos via fusing multimodal features. In:Proc. of the Chinese Conf. on Pattern Recognition. Singapore:Springer-Verlag, 2016. 632-644.[doi:10.1007/978-981-10-3005-5_52]
    [26] Chen S, Jin Q. Multi-Modal conditional attention fusion for dimensional emotion prediction. In:Proc. of the 2016 ACM on Multimedia Conf. ACM, 2016. 571-575.[doi:10.1145/2964284.2967286]
    [27] Peng Y, Huang X, Qi J. Cross-Media shared representation by hierarchical learning with multiple deep networks. In:Proc. of the IJCAI. 2016. 3846-3853. http://www.ijcai.org/Proceedings/16/Papers/541.pdf
    [28] Wu F, Lu X, Song J, Yan S, Zhang ZM, Rui Y, Zhuang Y. Learning of multimodal representations with random walks on the click graph. IEEE Trans. on Image Processing, 2016,25(2):630-642.[doi:10.1109/TIP.2015.2507401]
    [29] Hozjan V, Kačič Z. Context-Independent multilingual emotion recognition from speech signals. Int'l Journal of Speech Technology, 2003,6(3):311-320.[doi:10.1023/A:1023426522496]
    [30] Elbarougy R, Akagi M. Cross-Lingual speech emotion recognition system based on a three-layer model for human perception. In:Proc. of the Signal and Information Processing Association Annual Summit and Conf. (APSIPA). IEEE, 2013. 1-10.[doi:10.1109/APSIPA.2013.6694137]
    [31] Sagha H, Matejka P, Gavryukova M, Povolny F, Schuller B. Enhancing multilingual recognition of emotion in speech by language identification. In:Proc. of the Interspeech. 2016. 2949-2953. http://www.isca-speech.org/archive/Interspeech_2016/abstracts/0333.html
    [32] Abdelwahab M, Busso C. Supervised domain adaptation for emotion recognition from speech. In:Proc. of the 2015 IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015. 5058-5062.[doi:10.1109/ICASSP.2015.7178934]
    [33] Eyben F, Wöllmer M, Schuller B. Opensmile:The munich versatile and fast open-source audio feature extractor. In:Proc. of the 18th ACM Int'l Conf. on Multimedia. ACM, 2010. 1459-1462.[doi:10.1145/1873951.1874246]
    [34] Aytar Y, Vondrick C, Torralba A. Soundnet:Learning sound representations from unlabeled video. In:Advances in Neural Information Processing Systems. Miami Beach:Curran Associates, Inc., 2016. 892-900. http://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video.pdf
    [35] Wu SZ, Kan M, He ZL, Shan SG, Chen X. Funnel-Structured cascade for multi-view face detection with alignment-awareness. Neurocomputing, 2017,221(C):138-145.[doi:10.1016/j.neucom.2016.09.072]
    [36] Barsoum E, Zhang C, Ferrer CC, Zhang Z. Training deep networks for facial expression recognition with crowd-sourced label distribution. In:Proc. of the 18th ACM Int'l Conf. on Multimodal Interaction. New York:ACM, 2016. 279-283. https://dl.acm.org/citation.cfm?doid=2993148.2993165
    [37] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    [38] Li Y, Tao J, Schuller B, Jia J. MEC 2016:The multimodal emotion recognition challenge of CCPR 2016. In:Proc. of the Chinese Conf. on Pattern Recognition. Singapore:Springer-Verlag, 2016. 667-678.[doi:10.1007/978-981-10-3005-5_55]
    [39] Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T. Emotiw 2016:Video and group-level emotion recognition challenges. In:Proc. of the 18th ACM Int'l Conf. on Multimodal Interaction. ACM, 2016. 427-432.[doi:10.1145/2993148.2997638]
    附中文参考文献:
    [1] 张石清.基于语音和人脸的情感识别研究[博士学位论文].成都:电子科技大学,2012.
    [6] 韩文静,李海峰,阮华斌,马琳.语音情感识别研究进展综述.软件学报,2014,25(1):37-50. http://www.jos.org.cn/1000-9825/4497.htm[doi:10.13328/j.cnki.jos.004497]
    [24] 曹田熠.多模态融合的情感识别研究[博士学位论文].天津:天津大学,2012.
    Cited by
Get Citation

陈师哲,王帅,金琴.多文化场景下的多模态情感识别.软件学报,2018,29(4):1060-1070

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 30,2017
  • Revised:June 26,2017
  • Online: November 29,2017
You are the first2036681Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063