Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

1Fudan University, 2Youtu Lab, Tencent, China
* Indicates Equal Contribution
TL;DR: We propose a novel paradigm, dubbed as DICE-Talk, which is a new framework for generating talking head videos with vivid, identity-preserving emotional expressions.

Generated Videos

DICE-Talk produces vivid and diverse emotions for speaking portraits. The images and audios are collected from recent works or sourced from the Internet.

Abstract

Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable emotion banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

Framework

Image Description

Framework of DICE-Talk. Our method comprises three key components: disentangled emotion embedder, correlation-enhanced emotion conditioning, and emotion discrimination objective. These architectural elements work synergistically to decouple identity representations from emotional cues while preserving facial articulation details, thereby generating lifelike animated portraits with emotionally nuanced expressions.

BibTeX

@misc{tan2025disentangleidentitycooperateemotion,
      title={Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation}, 
      author={Weipeng Tan and Chuming Lin and Chengming Xu and FeiFan Xu and Xiaobin Hu and Xiaozhong Ji and Junwei Zhu and Chengjie Wang and Yanwei Fu},
      year={2025},
      eprint={2504.18087},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.18087}, 
}