Instruction-Driven 3D Facial Expression Generation and Transition

1Department of Computer Engineering, Sejong University, Seoul, Republic of Korea, 2The School of Computer Science and Technology, Anhui University, Hefei, China

Overview

Interpolate start reference image.

Abstract

A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction- driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of- the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajecto- ries according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications.

INTERPOLATION

Instruction: Transform this face from disgust to sadness

Interpolate start reference image.

Start Frame

Interpolate start reference image.
Interpolation end reference image.

End Frame


RESULTS

Facial Expression Transition


A comparison of Our method and MotionClip for generating facial expression transition on two benchmark datasets.

CK+

CelebV-HQ

Instruction: Turn this face from fear to surprise

Input Frame

Interpolate start reference image.

MotionClip

1

Our method

1

MotionClip

1

Our method

1

Instruction: Replace this face from surprise to happiness

Input Frame

Interpolate start reference image.

MotionClip

1

Our method

1

MotionClip

1

Our method

1

Instruction: Change this face from anger to sadness

Input Frame

Interpolate start reference image.

MotionClip

1

Our method

1

MotionClip

1

Our method

1


Face Rendering

CK+

CelebV-HQ

Instruction: Turn this face from fear to surprise

Input Frame

Interpolate start reference image.

Ours + ROME

1

Ours + CVTHead

1

Ours + ROME

1

Ours + CVTHead

1

Instruction: Replace this face from surprise to happiness

Input Frame

Interpolate start reference image.

Ours + ROME

1

Ours + CVTHead

1

Ours + ROME

1

Ours + CVTHead

1

Instruction: Change this face from anger to sadness

Input Frame

Interpolate start reference image.

Our + ROME

1

Our + CVTHead

1

Our + ROME

1

Our + CVTHead

1

The Different between with TFED and without TFED.

Instruction: Turn this face from sadness to anger

Input Frame

Interpolate start reference image.

w/o TFED

1

w/ TFED

1

Instruction: Transform this face from fear to disgust

Input Frame

Interpolate start reference image.

w/o TFED

1

w TFED

1

Instruction: Modify this face, changing it from happiness to surprise

Input Frame

Interpolate start reference image.

w/o TFED

1

w/ TFED

1

with a diverse range of expressions and poses

Instruction: Modify this face changing it from fear to disgust

w/o Neutral Expression

w/ Neutral Expression

Input Frame

Interpolate start reference image.

w/o NE

1

w/ NE

1

Instruction: Transform this face from anger to surprise

Input Frame

Interpolate start reference image.

w/o NE

1

w/ NE

1

Instruction: Change this face from fear to contempt

Input Frame

Interpolate start reference image.

w/o NE

1

w/ NE

1

with various instructions

Instruction: His face looks fearful, then transitions to disgust

Input Frame

Interpolate start reference image.

Output

1

Instruction: This photo captures his expression shifting between contempt and anger

Input Frame

Interpolate start reference image.

Output

1

Instruction: This face initially shows sadness and then surprise

Input Frame

Interpolate start reference image.

Output

1

Instruction: He seems happy, but suddenly becomes angry

Input Frame

Interpolate start reference image.

Output

1

BibTeX

@article{vo2025tmm,
  author    = {Anh H., Vo and Tae-Seok, Kim and Hulin, Jin and Soo-Mi, Choi and Yong-Guk, Kim},
  title     = {Instruction-Driven 3D Facial Expression Generation and Transition},
  journal   = {IEEE Transactions on Multimedia},
  year      = {2025},
}