Character: Generated by FLUX
Vocal Source: Online courses for legal exams
We proposed OSA-LCM, a two-stage training methods to achieve generating portrait videos in one sampling step. It consists of two training stage, both stage train adversarial latent consistency models (Adv-LCM) using the consistency loss and adversarial loss with a novel discriminator. In the second stage, we use a novel technique called editing fine-tuned methods (EFT) to further improve generated quaility of LCM in one sampling step.
Input a single character image and a vocal audio, such as singing, speaking and so on. our method can generate vocal avatar videos with expressive facial expressions, and various head poses, meanwhile with only one sampling step.
Our method supports songs in various languages and brings diverse portrait styles to life. It intuitively recognizes tonal variations in the audio, enabling the generation of dynamic, expression-rich avatars.
OSA-LCM and Adv-LCM can generate similar portrait videos in one or two sampling steps compared with the base model sampling with 20 sampling steps. Meanwhile, OSA-LCM can generate realistic portrait videos with different resolutions in one sampling step.
Character: Male generated by FLUX, 512 * 896
Character: Female generated by FLUX, 896 * 512
Character: Female in the HDTF
Character: Trump