Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025
1 The Laboratory of Intelligent Collaborative Computing of UESTC
2 Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province
3 Faculty of Information Technology, Monash University
* Corresponding author

Abstract

Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring the emotional states of individuals based on visual cues. However, recent approaches predominantly focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To address this limitation, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework for UCDVER. Specifically, KCDP first leverages a vision-language model to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our approach surpasses state-of-the-art models in both perceptibility and generalization, e.g., gaining 12% improvements over SOTA VER model TGCA-PVT.

Background

Visual Emotion Recognition (VER), a fundamental task in artificial intelligence and human-computer interaction, aims to identify human emotions through visual cues, such as facial expressions, body language and contextual scene features. Existing VER methods typically focus on realistic images and have gained considerable advancements on a suite of datasets such as EmoSet and Emotion6. Unfortunately, current VER models cannot handle emotion recognition in these new domains due to the significant emotional expression variability between domains and an affective distribution shift. In this paper, we introduce a new challenging task Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER), where a model is trained with labeled source-domain data (e.g., realistic images) and unlabeled target domain data (e.g., stickers), but is employed to recognize emotion in the target domain.

Motivation

Taking the stickers and realistic images as an example shown in Figure 1, two key challenges arise:

  • Emotional expression variability: Emotional expressions vary greatly. Realistic images reflect emotions expressed by real humans, while stickers exaggerate or simplify emotions, often focusing on single or multiple virtual elements.
  • Affective distribution shift: According to the Emotional Causality theory, an emotion is embedded in a sequence involving (i) external event; (ii) emotional state; and (iii) physiological response. Stickers or emojis emphasize the last two, i.e. (ii) and (iii) , while the emotion in realistic images is often linked to the external context surrounding the subject(s).

databias

Methods

We propose a Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework, which projects affective cues into a domain-agnostic knowledge space and performs domain-adaptive visual affective perception by a diffusion model.

Briefly, KCDP is composed of two primary modules: Knowledge-Alignment Diffusion Affective Perception (KADAP) and Counterfactual-Enhanced Language-Image Emotional Alignment (CLIEA). The \textbf{KADAP} module focuses on learning domain-agnostic knowledge and making robust predictions based on an MoE predictor , while the CLIEA module generates high-quality pseudo-labels for effective training .

framework

The CLIEA strategy is designed to generate high-quality emotional pseudo-labels for the target domain. CLIEA is inspired by the causal relationships underlying language-image emotional alignment.

framework

Citation

@article{ucdver,
  title={Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition},
  author={Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, Tao He},
  journal={CVPR},
  year={2025}
}