Abstract:Automatic emotion recognition is a challenging task with a wide range of applications. This paper addresses the problem of emotion recognition in multi-cultural conditions. Different multi-modal features are extracted from audio and visual modalities, and the emotion recognition performance is compared between hand-crafted features and automatically learned features from deep neural networks. Multimodal feature fusion is also explored to combine different modalities. The CHEAVD Chinese multimodal emotion dataset and AFEW English multimodal emotion dataset are utilized to evaluate the proposed methods. The importance of the culture factor for emotion recognition through cross-culture emotion recognition is demonstrated, and then three different strategies, including selecting corresponding emotion model for different cultures, jointly training with multi-cultural datasets, and embedding features from multi-cultural datasets into the same emotion space, are developed to improve the emotion recognition performance in the multi-cultural environment. The embedding strategy separates the culture influence from original features and can generate more discriminative emotion features, resulting in best performance for acoustic and multimodal emotion recognition.