VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

1Fudan University, 2Tencent

VividPose turns static images into high-fidelity and temporally consistent animation videos.

Abstract

Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.

Method

MY ALT TEXT

Overview of VividPose. The Denoising UNet (i.e., SVD), consists of Spatial-Attention, Cross-Attention, Temporal-Conv, and Temporal-Attention blocks. We utilize ReferenceNet to encode multi-scale features from the reference image, which are then injected into the denoising UNet via spatial self-attention. We also use CLIP to encode high-level features and ID Controller (i.e., ArcFace) to encode face identity features. These two features are injected via decoupled cross-attention. The compositional sequences consist of a skeleton map sequence extracted by DWPose and a rendering map sequence extracted by SMPLer-X. These sequences are initially encoded by the Pose Controller and then fused with noisy video frame latents. Furthermore, the reference image latent, generated by the VAE encoder, is concatenated with the noisy latents.

Make Human Images Vivid

VividPose demonstrates excellent performance on both real-world human images and cartoon images. Please note that cartoons are an unseen domain during training.

Make Human Images Fashionable

VividPose demonstrates excellent performance on UBCFashion dataset.

Make Human Images Dance

VividPose demonstrates excellent performance on TikTok dataset.

BibTeX

@article{wang2024vividpose,
      title={VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation}, 
      author={Qilin Wang and Zhengkai Jiang and Chengming Xu and Jiangning Zhang and Yabiao Wang and Xinyi Zhang and Yun Cao and Weijian Cao and Chengjie Wang and Yanwei Fu},
      journal={arXiv preprint arXiv:2405.18156v1},
      website={https://Kelu007.github.io/vivid-pose/},
      year={2024}
}