MonoHuman: Animatable Human Neural Field from Monocular Video

CVPR 2023

Zhengming Yu1, Wei Cheng1, 2, Xian Liu3, Wayne Wu2, Kwan-Yee Lin2, 3†,
1SenseTime Research, 2Shanghai AI Laboratory
3The Chinese University of Hong Kong

MonoHuman reconstructs animatable digital avatar from monocular video.


Animating virtual avatars with free-view control is crucial for various applications like virtual reality and digital entertainment. Previous studies attempt to utilize the representation power of neural radiance field (NeRF) to reconstruct the human body from monocular videos. Recent works propose to graft a deformation network into the NeRF to further model the dynamics of the human neural field for animating vivid human motions. However, such pipelines either rely on pose-dependent representations or fall short of motion coherency due to frame-independent optimization, making it difficult to generalize to unseen pose sequences realistically.

In this paper, we propose a novel framework MonoHuman, which robustly renders view-consistent and high-fidelity avatars under arbitrary novel poses. Our key insight is to model the deformation field with bi-directional constraints and explicitly leverage the off-the-peg keyframe information to reason the feature correlations for coherent results. In particular, we first propose a Shared Bidirectional Deformation module, which creates a pose-independent generalizable deformation field by disentangling backward and forward deformation correspondences into shared skeletal motion weight and separate non-rigid motions. Then, we devise a Forward Correspondence Search module, which queries the correspondence feature of keyframes to guide the rendering network. The rendered results are thus multi-view consistent with high fidelity, even under challenging novel pose settings. Extensive experiments demonstrate the superiority of our proposed MonoHuman over state-of-the-art methods.


Novel View Synthesis Results

Animated by Text Generated Poses

Text: The person is doing backflip.

Text: A person punches in a manner consistent with martial arts.

Text: A person first kick something with his left leg and then right leg.

Text: A person run in a circle.

*We use MDM model to generate poses from text.

Animated by Music Poses

Jazz Music

*We borrow music poses from AIST++ dataset.

Pop Music

In-the-wild Video

*The video is randomly selected from internet.

Related Links

Our work is inspired by many excellent works.

HumanNeRF introduces an idea to achieve free-viewpoint rendering of any frame from monocular video. MoCo-Flow use different mlps to model forward and backward deformation and synthesize novel views of dynamic humans from stationary monocular cameras. MDM generates realistic poses sequences from text input. IBRNet and NeuRay introduce ideas on how to utilize image features from different view images.

We thank these excellent works and their open-source contribution.


  title={MonoHuman: Animatable Human Neural Field from Monocular Video},
  author={Yu, Zhengming and Cheng, Wei and Liu, Xian and Wu, Wayne and Lin, Kwan-Yee},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},