WACV 2025

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

1Media Integration and Communication Center (MICC), University of Florence, Italy, 2Department of Architecture and Engineering University of Parma, Italy,
emovoca_idea


We introduce EmoVOCA, a novel approach for generating a synthetic 3D Emotional Talking Heads dataset which leverages speech tracks, intensity labels, emotion labels, and actor specifications. The proposed dataset can be used to surpass the lack of 3D datasets of expressive speech, and train more accurate emotional 3D talking head generators as compared to methods relying on 2D data as proxy.

Abstract

The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature.



Proposed Method

scantalk


Overview of our framework. Two distinct encoders process the talking and expressive 3D head displacements, separately, while a common decoder is trained to reconstruct them. At inference, talking and emotional heads are combined by concatenating the encoded latent vectors, and the decoder outputs a combination of their displacements.

Intro Video

Qualitative Examples

BibTeX


      @inproceedings{nocentini2024emovocaspeechdrivenemotional3d,
        title={EmoVOCA: Speech-Driven Emotional 3D Talking Heads}, 
        author={Federico Nocentini and Claudio Ferrari and Stefano Berretti},
        booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
        year = {2025},
      }