Abstract:Lip synchronization is the key issue in speech driven face animation system. In this paper, some clustering and machine learning methods are combined together to estimate face animation parameters from audio sequences and then apply the learning results to MPEG-4 based speech driven face animation system. Based on a large recorded audio-visual database, an unsupervised cluster algorithm is proposed to obtain basic face animation parameter patterns that can describe face motion characteristic. An Artificial Neural Network (ANN) is trained to map the cepstral coefficients of an individual's natural speech to face animation parameter patterns directly. It avoids the potential limitation of speech recognition. And the output can be used to drive the articulation of the synthetic face straightforward. Two approaches for evaluation test are also proposed: quantitative evaluation and qualitative evaluation. The performance of this system shows that the proposed learning algorithm is suitable, which greatly improves the realism of face animation during speech. And this MPEG-4 based learning are suitable for driving many different kinds of animation ranging from video-realistic image wraps to 3D Cartoon characters.