OSA-LCM: Real-time One-Step Diffusion-based Expressive Portrait Videos Generation

Hanzhong Guo, Hongwei Yi, Daquan Zhou, Alexander William Bergman, Michael Lingelbach, Yizhou Yu,

GitHub

Character: Generated by FLUX
Vocal Source: Online courses for legal exams

Character: Chinese Yuan Dynasty Emperors
Vocal Source: EMINEM - GODZILLA

Character: Leslie Cheung Kwok Wing
Vocal Source: Eminem - Rap God

Method

We proposed OSA-LCM, a two-stage training methods to achieve generating portrait videos in one sampling step. It consists of two training stage, both stage train adversarial latent consistency models (Adv-LCM) using the consistency loss and adversarial loss with a novel discriminator. In the second stage, we use a novel technique called editing fine-tuned methods (EFT) to further improve generated quaility of LCM in one sampling step.

place gif

Various Generated Videos

Singing

Make Portrait Sing

Input a single character image and a vocal audio, such as singing, speaking and so on. our method can generate vocal avatar videos with expressive facial expressions, and various head poses, meanwhile with only one sampling step.

Different Language

Our method supports songs in various languages and brings diverse portrait styles to life. It intuitively recognizes tonal variations in the audio, enabling the generation of dynamic, expression-rich avatars.

Japanese

Mandarin

Comparison

Compared with different sampling steps

OSA-LCM and Adv-LCM can generate similar portrait videos in one or two sampling steps compared with the base model sampling with 20 sampling steps. Meanwhile, OSA-LCM can generate realistic portrait videos with different resolutions in one sampling step.

Character: Male generated by FLUX, 512 * 896

Character: Female generated by FLUX, 896 * 512

Character: Female in the HDTF

Character: Trump