Abstract
Background
Visual Emotion Recognition (VER), a fundamental task in artificial intelligence and human-computer interaction, aims to identify human emotions through visual cues, such as facial expressions, body language and contextual scene features. Existing VER methods typically focus on realistic images and have gained considerable advancements on a suite of datasets such as EmoSet and Emotion6. Unfortunately, current VER models cannot handle emotion recognition in these new domains due to the significant emotional expression variability between domains and an affective distribution shift. In this paper, we introduce a new challenging task Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER), where a model is trained with labeled source-domain data (e.g., realistic images) and unlabeled target domain data (e.g., stickers), but is employed to recognize emotion in the target domain.
Motivation
Taking the stickers and realistic images as an example shown in Figure 1, two key challenges arise:
- Emotional expression variability: Emotional expressions vary greatly. Realistic images reflect emotions expressed by real humans, while stickers exaggerate or simplify emotions, often focusing on single or multiple virtual elements.
- Affective distribution shift: According to the Emotional Causality theory, an emotion is embedded in a sequence involving (i) external event; (ii) emotional state; and (iii) physiological response. Stickers or emojis emphasize the last two, i.e. (ii) and (iii) , while the emotion in realistic images is often linked to the external context surrounding the subject(s).
Methods
We propose a Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework, which projects affective cues into a domain-agnostic knowledge space and performs domain-adaptive visual affective perception by a diffusion model.
Briefly, KCDP is composed of two primary modules: Knowledge-Alignment Diffusion Affective Perception (KADAP) and Counterfactual-Enhanced Language-Image Emotional Alignment (CLIEA). The \textbf{KADAP} module focuses on learning domain-agnostic knowledge and making robust predictions based on an MoE predictor , while the CLIEA module generates high-quality pseudo-labels for effective training .
The CLIEA strategy is designed to generate high-quality emotional pseudo-labels for the target domain. CLIEA is inspired by the causal relationships underlying language-image emotional alignment.
Citation
@article{ucdver,
title={Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition},
author={Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, Tao He},
journal={CVPR},
year={2025}
}