Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

Wen-Li Wei *
Jen-Chun Lin *
Tyng-Luh Liu
Hong-Yuan Mark Liao

* authors contributed equally

Academia Sinica, Taiwan


Overview of our motion pose and shape network (MPS-Net). MPS-Net estimates pose, shape, and camera parameters Θ in the video sequence based on the static feature extractor, temporal encoder, temporal feature integration, and SMPL parameter regressor to generate 3D human pose and shape.

Learning to capture human motion is essential to 3D human pose and shape estimation from monocular video. However, the existing methods mainly rely on recurrent or convolutional operation to model such temporal information, which limits the ability to capture non-local context relations of human motion. To address this problem, we propose a motion pose and shape network (MPS-Net) to effectively capture humans in motion to estimate accurate and temporally coherent 3D human pose and shape from a video. Specifically, we first propose a motion continuity attention (MoCA) module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in the sequence to better capture the motion continuity dependencies. Then, we develop a hierarchical attentive feature integration (HAFI) module to effectively combine adjacent past and future feature representations to strengthen temporal correlation and refine the feature representation of the current frame. By coupling the MoCA and HAFI modules, the proposed MPS-Net excels in estimating 3D human pose and shape in the video. Though conceptually simple, our MPS-Net not only outperforms the state-of-the-art methods on the 3DPW, MPI-INF-3DHP, and Human3.6M benchmark datasets, but also uses fewer network parameters.

Left: The video shows the 3D human pose and shape estimation results of our MPS-Net from different viewpoints. Top-Right: Visual results of the continuity of human motion learned by MPS-Net (MPS-Net produces a transition effect between pose exchanges). Bottom-Right: Visualization of a standing still person by MPS-Net. When the input video content is a person standing still, MPS-Net does not force the subject to be in motion.


[Paper] to appear in CVPR, 2022

[Code] Coming soon!

Demo Videos (MPS-Net)

Original video frame rate: 25fps

Original video frame rate: 25fps

Original video frame rate: 25fps

Original video frame rate: 60fps

Original video frame rate: 29.97fps
Original video frame rate: 23.98fps
Original video frame rate: 30fps
Original video frame rate: 30fps

Visual effects of MPS-Net in alternative viewpoints

We visualize the 3D human body estimated by our MPS-Net from different viewpoints. The results show that our MPS-Net is able to estimate the correct global body rotation.

Qualitative Results-1

Visual comparison between MPS-Net-only MoCA and VIBE [20]

Qualitative Results-2

Visual comparison between our MPS-Net (final version) and TCMR [6]

Qualitative comparison of TCMR [6] (left) and our MPS-Net (right) on the challenging in-the-wild 3DPW [37] and MPI-INF-3DHP [27] datasets.

Qualitative comparison of TCMR [6] (middle row) and our MPS-Net (bottom row) on the challenging in-the-wild 3DPW [37] dataset.

Qualitative comparison of TCMR [6] (middle row) and our MPS-Net (bottom row) on the challenging in-the-wild 3DPW [37] dataset.

Qualitative comparison of TCMR [6] (middle row) and our MPS-Net (bottom row) on the MPI-INF-3DHP [27] dataset.

Qualitative comparison of TCMR [6] (middle row) and our MPS-Net (bottom row) on the Human3.6M [16] dataset.

Quantitative Results

Table 1. Evaluation of state-of-the-art video-based methods on 3DPW [37], MPI-INF-3DHP [27], and Human3.6M [16] datasets. Following Choi et al. [6], all methods are trained on the training set including 3DPW, but do not use the Human3.6M SMPL parameters obtained from Mosh [23]. The number of input frames follows the original protocol of each method.

Table 2. Comparison of the number of network parameters, FLOPs, and model size.

Table 3. Ablation study for different modules of the MPS-Net on the 3DPW [37] dataset. The training and evaluation settings are the same as the experiments on the 3DPW dataset in Table 1.

Table 4. Evaluation of state-of-the-art single image-based and video-based methods on the 3DPW [37] dataset. All methods do not use 3DPW for training.


[6] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3D human pose and shape from a video. CVPR, 2021.

[8] Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: Motion to the rescue. NeurIPS, 2019.

[16] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2014.

[17] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. CVPR, 2018.

[18] Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. Learning 3D human dynamics from video. CVPR, 2019.

[20] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. CVPR, 2020.

[21] Nikos Kolotouros, G. Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. ICCV, 2019.

[22] Nikos Kolotouros, G. Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. CVPR, 2019.

[23] Matthew Loper, Naureen Mahmood, and Michael J. Black. MoSh: Motion and shape capture from sparse markers. ACM ToG, 33(6):220:1–220:13, 2014.

[25] Zhengyi Luo, S. Alireza Golestaneh, and Kris M. Kitani. 3D human motion estimation via motion compression and refinement. ACCV, 2020.

[27] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. 3DV, 2017.

[28] Gyeongsik Moon and Kyoung Mu Lee. I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. ECCV, 2020.

[34] Yu Sun, Yun Ye, Wu Liu, Wenpeng Gao, Yili Fu, and Tao Mei. Human mesh recovery from monocular images via a skeleton-disentangled representation. ICCV, 2019.

[37] Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. ECCV, 2018.

[38] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. CVPR, 2018.

[41] Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. ICCV, 2021.

Contact Information

Wen-Li Wei, Jen-Chun Lin {;}